3 research outputs found

    Toward Tweets Normalization Using Maximum Entropy

    Get PDF
    Abstract The use of social network services and microblogs, such as Twitter, has created valuable text resources, which contain extremely noisy text. Twitter messages contain so much noise that it is difficult to use them in natural language processing tasks. This paper presents a new approach using the maximum entropy model for normalizing Tweets. The proposed approach addresses words that are unseen in the training phase. Although the maximum entropy needs a training dataset to adjust its parameters, the proposed approach can normalize unseen data in the training set. The principle of maximum entropy emphasizes incorporating the available features into a uniform model. First, we generate a set of normalized candidates for each out-ofvocabulary word based on lexical, phonemic, and morphophonemic similarities. Then, three different probability scores are calculated for each candidate using positional indexing, a dependency-based frequency feature and a language model. After the optimal values of the model parameters are obtained in a training phase, the model can calculate the final probability value for candidates. The approach achieved an 83.12 BLEU score in testing using 2,000 Tweets. Our experimental results show that the maximum entropy approach significantly outperforms previous well-known normalization approaches

    Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot

    Get PDF
    The expeditious spread of blogs, microblogs, and social network services has led to accelerate the usage of casual written language, known as user generated content (UGC). The UGC diverges from standard writing conventions because of the usage of coding strategies, such as phonetic transcriptions (are → r), digit phonemes (me too → me2), misspellings (misappropriate → missapropriate), vowel drops (double → dble), and missing or incorrect punctuation marks (In that situation, I'd possibly come. → In that situation Id possibly come). These modifications are due to three primary elements: 1) limited message length (e.g. 140 characters per Tweet); 2) miniature keyboards; and 3) extensive usage of UGC in unofficial and informal communications. However, the existence of many out-of-vocabulary (OOV) words, also known as unknown words, substantially disturbs standard natural language processing (NLP) systems. Therefore, research in NLP has increasingly focused on the text normalization task, where the OOV words will convert into their context-appropriate standard words. Currently, while diverse normalization approaches exist in the English language, the problem is neglected in other languages, such as Malay language. In this work, the Malay language is chosen because of its considerable usage on Twitter, where, it is the fourth leading language used in Twitter. Thus, a rule-based approach to normalize the Malay language Twitter messages is proposed based on corpus-driven analysis. To do so, a corpus-driven analysis depends on frequencies in specifying word-frequency lists, concordancing, clusters, and keywords. To design the normalization system, three analyzing tasks on the Malay language Twitter corpus and standard Malay corpus were performed: 1) frequency of unknown words; 2) abbreviation patterns; and 3) letter repetition. A Malay language Twitter corpus known as Malay Chat-style Corpus (MCC) is constructed. The MCC, which encompasses 1 million twitter messages, consists of 14,484,384 word instances, 646,807 unique vocabularies, and metadata, such as used Twitter client application, posting time, and type of Twitter message (simple Tweet, Retweet, Reply). To build the MCC, which represents the Malay language Twitter lingo, corpus-compiling criteria were considered which are: sampling, representativeness, machine readability, balance, and size of data. A portion of the MCC is manually annotated to be used in the development and testing stages of the normalization system. The architecture of the Malay normalization system contains seven primary modules: (1) enhanced tokenization; (2) In-Vocabulary (IV) detection; (3) colloquial dictionary lookup; (4) repeated letter elimination; (5) abbreviation normalizer; (6) English word translation; and (7) de-tokenization. The normalization modules are formulated based on the result of MCC analysis and implemented via rule-based state machines. An evaluation is performed in term of BLEU score to measure the accuracy of the system. The result is encouraging whereby 0.91 BLEU score is achieved against 0.46 BLEU baseline score. To compare the accuracy of the system with other probabilistic approaches with an identical Malay dataset, statistical machine translation (SMT) normalization system is chosen to be implemented, trained, and evaluated. The experimental results prove that higher accuracy is achieved by the proposed architecture, which is designed based on the results of our corpus-driven analysis
    corecore